OOV Detection and Recovery Using Hybrid Models with Different Fragments

نویسندگان

  • Long Qin
  • Ming Sun
  • Alexander I. Rudnicky
چکیده

In this paper, we address the out-of-vocabulary (OOV) detection and recovery problem by developing three different fragmentword hybrid systems. A fragment language model (LM) and a word LM were trained separately and then combined into a single hybrid LM. Using this hybrid model, the recognizer can recognize any OOVs as fragment sequences. Different types of fragments, such as phones, subwords, and graphones were tested and compared on the WSJ 5k and 20k evaluation sets. The experiment results show that the subword and graphone hybrid systems perform better than the phone hybrid system in both 5k and 20k tasks. Furthermore, given less training data, the subword hybrid system is more preferable than the graphone hybrid system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OOV Word Detection using Hybrid Models with Mixed Types of Fragments

This paper presents initial studies to improve the out-ofvocabulary (OOV) word detection performance by using mixed types of fragment units in one hybrid system. Three types of fragment units, subwords, syllables, and graphones, were combined in two different ways to build the hybrid lexicon and language model. The experimental results show that hybrid systems with mixed types of fragment units...

متن کامل

Learning Out-of-Vocabulary Words in Automatic Speech Recognition

Out-of-vocabulary (OOV) words are unknown words that appear in the testing speech but not in the recognition vocabulary. They are usually important content words such as names and locations which contain information crucial to the success of many speech recognition tasks. However, most speech recognition systems are closed-vocabulary recognizers that only recognize words in a fixed finite vocab...

متن کامل

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

OOV-detection in large vocabulary system using automatically defined word-fragments as fillers

The problem of unknown words has been addressed using automatically generated ller fragments which augment the lexicon and are incorporated in the language model. These fragments are used to reduce the damage on in-vocabulary words, to detect OOV regions and to provide a phonetic transcription for these regions. The performance of this technique has been evaluated in terms of damage reduction e...

متن کامل

THE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition

Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011